Molecular standards for microbial pathogens

ABSTRACT

A method for constructing a consensus sequence from a sequence alignment. The consensus sequence may be used to generate molecular standards that may substitute for genomic DNA in various assays. Since a molecular standard cannot have unresolved bases, the method removes less informative sequences to resolve all positions in the alignment. Also includes several sequences from pathogenic waterborne species that were constructed according to the method.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims priority to and is a continuation of U.S. patentapplication Ser. No. 12/870,481; filed Aug. 27, 2010 and claims thebenefit of U.S. Provisional Patent Application No. 61/237,933, filed onAug. 28, 2009, the disclosures of which are expressly incorporated byreference herein in their entireties.

STATEMENT OF GOVERNMENT INTEREST

The United States Government may own rights in the present disclosurepursuant to NIH 2 R42 AI069598-02 and NSF 0945221.

BACKGROUND

1. Field of the Present Disclosure

The present disclosure provides a library of synthetic standardmolecules for multiple species of microbial pathogens, includingCryptosporidium, Giardia, and microsporidia. Each of these standardmolecules includes a bacterial plasmid molecule containing a specificDNA sequence insert that represents a consensus sequence of the 18s rRNAgene for a single species of interest. These standard molecules may beused by, for example, researchers, utility operators, and clinicallaboratory technicians as a surrogate for native genomic DNA in avariety of situations.

2. Related Art

Due to technological limitations, environmental and clinicallaboratories are increasingly moving away from microscopic methods andtowards molecular detection methods. Molecular methods typically use thepolymerase chain reaction (PCR) to detect a specific DNA sequence in thegenome of a target organism. Compared to microscopic methods, molecularmethods offer increased speed, sensitivity, and reproducibility.Molecular methods can also provide supplementary data unattainable usingmicroscopy, such as, for example, genotype identification.

One microbial pathogen of particular interest is Cryptosporidium.Fifteen waterborne Cryptosporidium outbreaks were reported in the UnitedStates between 1991 and 2002, affecting over 408,000 individuals. Thismakes Cryptosporidium the highest cause of waterborne disease by numberof affected individuals. The most significant outbreak occurred inMilwaukee, Wis. in 1993. This well-studied case affected over 403,000individuals and cost the region an estimated $96.2 million. This event,plus several major recreational outbreaks since then, underscores theimportance of proper water monitoring.

Giardia contamination can cause outbreaks that result in similardisruptions.

Microsporidia, including Encephalitozoon intestinalis and Enterocytozoonbieneusi, cause microspridiosis, which is an opportunistic infectionthat can cause diarrhea and wasting in immunocompromised patients.

Unfortunately, the introduction of new molecular tools for targets suchas, e.g., Cryptosporidium and Giardia has been restricted by the lack ofstandardized positive controls. Positive controls, typically purifiedgenomic DNA from pathogens of interest, may have multiple roles in thedevelopment and validation of a molecular method. Two of the mostimportant roles include:

-   -   As a sensitivity control, determining detection limits and        quantifying target DNA; and    -   As a specificity control, resolving target genotypes.

It can be extremely challenging to obtain positive controls formicrobial pathogens of environmental interest. Many such organisms aredifficult to culture in vitro. The distribution of others is regulatedby the Centers for Disease Control and/or the United States Departmentof Commerce. Researchers who wish to develop new tests for thesepathogens must often perform their own isolations from clinical orenvironmental samples or obtain specimens from collaborators. Theseresearch stocks are often subject to inconsistent quality control,increase the risk of laboratory-associated infections, and are ofinsufficient quantity for industrial-scale development and validation.Until an alternative source can be developed, the limited availabilityof positive controls threatens to prevent the introduction of any newmolecular tests for microbial pathogens into the market.

In the case of Cryptosporidium, limited amounts of positive control DNAare available from a handful of Biological Resource Centers (BRCs). Inparticular, Waterborne supplies purified C. parvum and C. muris oocysts,while AMERICAN TYPE CULTURE COLLECTION™ (ATCC) can regularly provideresearchers with genomic DNA from C. parvum (Iowa strain). A thirdcommercial source, the Biodefense and Emerging Infections ResearchResources Repository (BEI Resources), supplies genomic DNA and otherreagents only to NIH-funded investigators. These supplies areinsufficient for widespread method development, especially for assaysaimed at distinguishing multiple genotypes. As a result, testdevelopment has been fragmented as research groups rely on variousorganism stocks of inconsistent quality.

Accordingly, there exists a pressing need for standardized positivecontrols for Cryptosporidium, Giardia, microsporidia, and othermicrobial pathogens that may be used to develop and validate moleculardetection and genotyping methods.

SUMMARY OF THE PRESENT DISCLOSURE

The present disclosure meets the foregoing need and allows detection ofpathogenic species using molecular methods, which results in asignificant improvement in speed, sensitivity, and reproducibility andother advantages apparent from the discussion herein.

Accordingly, in one aspect of the present disclosure, a method isdescribed for constructing a consensus sequence from an alignment of twoor more nucleic acid sequences. The method includes iterating over eachposition in the alignment and taking the following actions at eachposition: (1) calculate the base frequencies and determine the base withthe highest frequency; (2) if the frequency of the most common base withthe highest frequency is greater than a specified frequency threshold,then the base is assigned to that position in the consensus sequence;and (3) if the frequency of the most common base is below the frequencythreshold, then the base corresponding to the nucleic acid sequence withthe lowest information score is removed and the process repeats fromaction (1).

The method may include generating a frequency matrix, which includes thefrequency of each base at each position in the alignment; creating aninformation matrix, which includes the amount of information provided byeach base at each position in the alignment; and calculating aninformation score for each nucleic acid sequence. As part of creating aninformation matrix, the method may calculate the decrease in Shannonuncertainty for each base at each position in the alignment. As part ofcalculating an information score, the method may sum the decreases inShannon uncertainty for each base in each sequence. Insertions andDeletions may be removed from the multiple sequence alignment. Thefrequency threshold for actions (2) and (3) may be 0.7.

The method may be used to construct a consensus sequence. A restrictionfragment length polymorphism (RFLP) fingerprint of the constructedconsensus sequence may be compared to RFLP fingerprints of one or moreof the nucleic acid sequences in the multiple sequence alignment.Binding of oligonucleotides to the consensus sequence and to sequencesin the multiple sequence alignment may be compared on the basis ofGibb's free energy of hybridization, melting temperature of theheterodimer, and binding position. The consensus sequence may be used tosynthesize a DNA construct or molecular standard. The DNA construct maybe linear, or it may be circular, e.g., a plasmid.

According to another aspect of the present disclosure, a multiplesequence alignment, which includes a number of alignment positions, isused to construct a consensus sequence. As part of this method, afrequency matrix, which includes the frequency of each base at eachalignment position, is generated. An information matrix, which includesthe amount of information provided by each base at each alignmentposition, is also generated, and an information score is calculated foreach sequence in the multiple sequence alignment. The method iteratesover the alignment and at each alignment position, does the following:(1) determining which base at the alignment position has the highestfrequency; (2) if the frequency of the highest frequency base is above athreshold value, the base is assigned to the consensus sequence; and (3)if the frequency of the highest frequency base is below the threshold,the base corresponding to the sequence with the lowest information scoreis removed, base frequencies are recalculated, and the procedure returnsto action (1).

As part of creating an information matrix, the method may calculate thedecrease in Shannon uncertainty for each base at each position in thealignment. As part of calculating an information score, the method maysum the decreases in Shannon uncertainty for each base in each sequence.Insertions and Deletions may be removed from the multiple sequencealignment. The frequency threshold for actions (2) and (3) may be 0.7.

The method may be used to construct a consensus sequence. A restrictionfragment length polymorphism (RFLP) fingerprint of the constructedconsensus sequence may be compared to RFLP fingerprints of one or moreof the nucleic acid sequences in the multiple sequence alignment.Binding of oligonucleotides to the consensus sequence and to sequencesin the multiple sequence alignment may be compared on the basis ofGibb's free energy of hybridization, melting temperature of theheterodimer, and binding position. The consensus sequence may be used tosynthesize a DNA construct or molecular standard. The DNA construct maybe linear, or it may be circular, e.g., a plasmid.

According to yet another aspect of the present disclosure, 18S rRNAconsensus sequences are disclosed for Cryptosporidium andersoni, asshown in SEQ ID NO:1; Cryptosporidium baileyi, as shown in SEQ ID NO:2;Cryptosporidium bovis, as shown in SEQ ID NO:3; Cryptosporidium canis,as shown in SEQ ID NO:4; Cryptosporidium felis, as shown in SEQ ID NO:5;Cryptosporidium hominis, as shown in SEQ ID NO:6; Cryptosporidiummeleagridis, as shown in SEQ ID NO:7; Cryptosporidium muris, as shown inSEQ ID NO:8; Cryptosporidium parvum, as shown in SEQ ID NO:9;Cryptosporidium serpentis, as shown in SEQ ID NO:10; Cryptosporidiumwrairi, as shown in SEQ ID NO:11; Giardia intestinalis, as shown in SEQID NO:12; Encephalitozoon intestinalis, as shown in SEQ ID NO:13; andEnterocytozoon bieneusi, as shown in SEQ ID NO:14.

Additional features, advantages, and embodiments of the presentdisclosure may be set forth or apparent from consideration of thefollowing detailed description, drawings, and claims. Moreover, it is tobe understood that both the foregoing summary of the present disclosureand the following detailed description are exemplary and intended toprovide further explanation without limiting the scope of the presentdisclosure as claimed.

BRIEF DESCRIPTION OF THE DRAWINGS

The accompanying drawings, which are included to provide a furtherunderstanding of the present disclosure, are incorporated in andconstitute a part of this specification, illustrate embodiments of thepresent disclosure and together with the detailed description serve toexplain the principles of the present disclosure. No attempt is made toshow structural details of the present disclosure in more detail thanmay be necessary for a fundamental understanding of the presentdisclosure and the various ways in which it may be practiced. In thedrawings:

FIG. 1 shows standard curves of C. hominis genomic and synthetic targetDNA in real-time PCR assays;

FIG. 2 shows standard curves of C. meleagridis genomic and synthetictarget DNA in real-time PCR assays;

FIG. 3 shows standard curves of C. parvum genomic and synthetic targetDNA in real-time PCR assays;

FIG. 4 shows standard curves of C. muris genomic and synthetic targetDNA in real-time PCR assays;

FIG. 5 shows standard curves of G. intestinalis genomic and synthetictarget DNA in real-time PCR assays; and

FIG. 6 shows standard curves of C. felis synthetic target DNA inreal-time PCR assays.

DETAILED DESCRIPTION

It is understood that the present disclosure is not limited to theparticular methodology, protocols, and reagents, etc., described herein,as these may vary as the skilled artisan will recognize. It is also tobe understood that the terminology used herein is used for the purposeof describing particular embodiments only, and is not intended to limitthe scope of the present disclosure. It also is be noted that as usedherein and in the appended claims, the singular forms “a,” “an,” and“the” include the plural reference unless the context clearly dictatesotherwise. Thus, for example, a reference to “a capsule” is a referenceto one or more capsules and equivalents thereof known to those skilledin the art.

Unless defined otherwise, all technical and scientific terms used hereinhave the same meanings as commonly understood by one of ordinary skillin the art to which the present disclosure pertains. The embodiments ofthe present disclosure and the various features and advantageous detailsthereof are explained more fully with reference to the non-limitingembodiments and/or illustrated in the accompanying drawings and detailedin the following description. It should be noted that the featuresillustrated in the drawings are not necessarily drawn to scale, andfeatures of one embodiment may be employed with other embodiments as theskilled artisan would recognize, even if not explicitly stated herein.

Any numerical values recited herein include all values from the lowervalue to the upper value in increments of one unit provided that thereis a separation of at least two units between any lower value and anyhigher value. As an example, if it is stated that the concentration of acomponent or value of a process variable such as, for example, size,temperature, pressure, time and the like, is, for example, from 1 to 90,specifically from 20 to 80, more specifically from 30 to 70, it isintended that values such as 15 to 85, 22 to 68, 43 to 51, 30 to 32etc., are expressly enumerated in this specification. For values whichare less than one, one unit is considered to be 0.0001, 0.001, 0.01 or0.1 as appropriate. These are only examples of what is specificallyintended and all possible combinations of numerical values between thelowest value and the highest value enumerated are to be considered to beexpressly stated in this application in a similar manner.

Moreover, provided immediately below is a “Definition” section, wherecertain terms related to the present disclosure are definedspecifically. Particular methods, devices, and materials are described,although any methods and materials similar or equivalent to thosedescribed herein can be used in the practice or testing of the presentdisclosure. All references referred to herein are incorporated byreference herein in their entirety.

1. DEFINITIONS

The terms “alignment” and “sequence alignment” as used herein refer toarrangement of two or more nucleic acid sequences that may be used toidentify regions of similarity between the sequences. If the sequencesare displayed horizontally, then the individual bases from differentsequences are arranged in vertical columns, which may be referred to as“alignment positions”.

The term “base” as used herein refers to a single monomer of a nucleicacid.

The term “base frequency” as used herein refers to the frequency withwhich a given base appears in a particular grouping of bases, such as anucleic acid sequence or an alignment position.

The term “consensus sequence” as used herein refers to a representationof a sequence alignment that

The term “Cryptosporidium” as used herein by itself, not followed by aspecies name, means any species of Cryptosporidium which is known tocause disease, including, for example, C. parvum, C. felis, C. muris, C.meleagridis, C. suis, C. canis, and/or C. hominis.

The term “DNA construct” as used herein refers to an artificiallyconstructed segment of nucleic acid.

The term “Giardia” as used herein by itself, not followed by a speciesname, means any species of Giardia which is known to cause disease. Thismay include, for example, G. lamblia, G. duodenalis, and/or G.intestinalis.

The term “microsporidia” as used herein refers to any species ofmicrosporidia which is known to cause disease, including, e.g., E.intestinalis and/or E. bieneusi.

The term “nucleic acid,” as used herein, may include an oligonucleotide,nucleotide, or polynucleotide, and fragments thereof. The term may referto DNA or RNA of genomic or synthetic origin, which may be single- ordouble-stranded and may represent the sense or antisense strand.Additionally, the term may refer to peptide nucleic acid (PNA), to smallinterfering RNA (siRNA) molecule, or to any DNA-like or RNA-likematerial, natural or synthetic in origin.

The term “nucleic acid sequence” as used herein refers to the specificorder of monomers in a nucleic acid molecule that includes two or moremonomers.

The term “PCR” as used herein means the polymerase chain reaction, as iswell-known in the art. The term includes all forms of PCR, such as,e.g., real-time PCR and quantitative PCR.

The term “plasmid” as used herein refers to a circular nucleic acidmolecule that is separate from a cell's chromosome(s) and may replicateindependently of the chromosome(s).

The terms “restriction fragment length polymorphism” and “RFLP” as usedherein refer to a difference between two or more nucleic acid samples.Differences in sequence between the samples result in differentendonuclease restriction (cutting) sites, which in turn producefragments of different length after digestion by a particularendonuclease. The particular pattern of fragments that a sample producesmay be referred to as a “RFLP fingerprint.”

2. DESCRIPTION

Molecular methods are increasingly used for the detection of pathogens,due to superiority of these methods over traditional microscopicmethods. Molecular tools for many pathogenic species, however, may beunavailable because the organisms are difficult to culture in vitro,resulting in a lack of standardized positive controls. Consequently,researchers often rely on their own isolations, which can varydramatically in quality. Development of molecular methods is fragmentedas different groups rely on organism stocks of inconsistent quality.Thus, there is need for standardized positive controls for theseorganisms, such as, e.g., Cryptosporidium, Giardia, and microsporidia,that can be used to develop and validate molecular detection andgenotyping methods.

One promising solution may be found in the advancing field of syntheticbiology. Synthetic biology involves using engineering tools to generatebiological components de novo from DNA sequences. Much of this workrelies on recent improvements in chemical DNA synthesis by third-partymanufacturers.

One aspect of the present disclosure is directed to a robust workflowfor designing synthetic positive controls. This workflow has beenemployed to produce consensus sequences for Cryptosporidium hominis, C.meleagridis, C. felis, C. parvum, C. muris, C. andersoni, C. baileyi, C.bovis, C. canis, C. serpentis, C. wrairi, Giardia intestinalis,Encephalitozoon intestinalis, and Enterocytozoon bieneusi. Additionally,molecular standards have been produced and tested for C. felis, C.parvum, C. muris, C. hominis, C. meleagridis, and G. intestinalis. Eachof these molecular standards may include a bacterial plasmid moleculecontaining a synthetically-produced DNA insert, the sequence of whichmay represent the 18s rRNA gene of a single Cryptosporidium, Giardia, ormicrosporidia species. These molecular standards may be used as asurrogate for native genomic DNA in a variety of situations.

This approach has a number of advantages over traditional sources ofpositive control DNA, including, for example, the following:

-   -   The efficiency of chemical DNA synthesis allows rapid        prototyping and validation.    -   Synthetic standards are extremely stable, extending storage life        and ensuring high-quality analytical results.    -   Synthetic standards can be designed to allow use by multiple        research teams to develop and validate their molecular assays.    -   The design and production of synthetic standards can be        subjected to precise quality control.    -   Synthetic standards present a lower risk of laboratory        contamination than live organism cultures, allowing easier        distribution and use by academic, commercial, and educational        groups.

The design workflow described in this disclosure includes six parts,including:

-   -   1. Identify relevant reference sequences for the target gene.    -   2. Align the reference sequences using one of several existing        software applications.    -   3. Reduce the multiple sequence alignment into a single        consensus sequence using a novel algorithm.    -   4. Computationally verify that the consensus sequence exhibits        the same properties as one or more reference sequences.    -   5. Synthesize the consensus sequence and incorporate it into a        bacterial plasmid to create a molecular standard.    -   6. Optimize PCR performance of the molecular standard.

Each of these parts will now be described in detail.

Part 1: Identify Relevant Reference Sequences.

For most pathogens of interest, there is a general consensus in thescientific literature as to which gene represents the most appropriatetarget for molecular assays. In the case of Cryptosporidium, Giardia,and microsporidia, this gene is 18s rRNA. However, alternative genes maybe used in these or other species without departing from the spirit orscope of the present disclosure. The selected gene or genes may or maynot include additional flanking sequences. The first step in the designworkflow is to collect as many 18s rRNA reference sequences for thespecies of interest as possible. Many of these are available on publicdatabases, such as, for example, GENBANK™. However, other sequences maybe obtained as part of a private sequencing effort.

Some reference sequences may be more valuable than others. For example,sequences that were obtained many years ago may reflect incorrect namingconventions or include inaccurate information. These distinctions shouldbe made to ensure that only the most accurate reference sequences areused to design the molecular standards. Ideally, at the end of Part 1, alist of reference sequences should have been determined. The list shouldrepresent the efforts of multiple sequencing groups, at differentgeographic locations, analyzing multiple strains or isolates.

Part 2: Align Reference Sequences Using One of Several Multiple SequenceAlignment Applications.

In order for the reference sequences to be useful, they should first bealigned to each other. Although each sequence may be from the samespecies, they all will likely be from different isolates or sub-species.Each of these sub-species may have subtle differences in the sequence oftheir 18s rRNA gene. Also, due to variability in sequencing equipmentand protocols, each reference sequence may likely be of a unique length.Long sequences may be likely to contain many different patterns, whilesmaller sequences may be likely to contain only a subset of thesepatterns. By aligning these sequences, it is possible to identifyregions of similarity between some or all of the sequences.

For example, an unaligned set of four sequences may be represented as:

SEQ ID NO: 15 ACTGGTAGCTAGCCTGGATCGATCGGGTGTAGTACTGA SEQ ID NO: 16TAGCCTGGATCCATCG SEQ ID NO: 17 TATTACTGA SEQ ID NO: 18 TAGGTAGCCTGGATC

The alignment of these four sequences may be represented as:

SEQ ID NO: 19 ACTGGTAGCTAGCCTGGATCGATCGGGTGTAGTACTGA SEQ ID NO: 20---------TAGCCTGGATCCATCG------------- SEQ ID NO: 21-----------------------------TATTACTGA SEQ ID NO: 22-----TAGGTAGCCTGGATC------------------

Many third-party software applications exist to perform multiplesequence alignments, including, for example, MUSCLE, MAFFT, MACAW,T-Coffee, and CLUSTAL. Alignments used in the development of the presentdisclosure were obtained using CLUSTAL. The output of CLUSTAL included atext file containing a multiple sequence alignment of all referencesequences identified during Part 1.

Part 3: Reduce the Multiple Sequence Alignment to a Single ConsensusSequence.

The multiple sequence alignment obtained during Part 2 may represent allof the various sequences a researcher might obtain if he or shesequenced the 18s rRNA gene of the Cryptosporidium, Giardia, ormicrosporidia species of interest from a particular source. However, itmay not be possible to determine which of the reference sequences may bethe most representative of the researcher's particular isolate.Therefore, it may not be possible to simply select an arbitraryreference sequence and use it as the basis for a molecular standardbecause the arbitrarily selected sequence may not bear enoughsimilarities to the researcher's isolate.

Instead, a single consensus sequence may be generated from the multiplesequence alignment. This consensus sequence may not be identical to anyof the reference sequences. However, it may be closely similar to all ofthem. In this way, it may be possible to ensure that the molecularstandard may be useable by all researchers working with the species ofinterest, regardless of their particular isolate.

The consensus sequence determination may be made using a multi-partcomputational algorithm:

Part 3a: Remove Base Pair Inserts and Deletions.

During the alignment of the DNA sequences as outlined in Part 2, thealignment software may discover an insert (an unnecessary base) or adeletion (a missing base) in one or more sequences. These inserts anddeletions should be removed before further analysis. For example,consider a sequence alignments represented as:

SEQ ID NO: 23 TATCAACAT_CCTTCCTATTATATTTCT SEQ ID NO: 24TATCAACAT_CCTTCCTATTATAT_TCT SEQ ID NO: 25 TATCAACATTCCTTCCTATTATATTTCTSEQ ID NO: 26 TATCGACAT_CCTTCCTATTATATATCT

At position 10 of the example alignment, sequence 3 has a base insertthat does not exist in any other sequence. In a method according to thepresent disclosure, this inserted base may be removed. Sequence 2 has adeletion that occurs at position 25. Due to the existence of a largernumber of bases at that position, these bases may not be removed.

Part 3b: Create Frequency Matrix.

A frequency matrix may be generated, containing the frequency of eachbase at each alignment position. For example, consider 5 referencesequences aligned in the following way:

Position # 1 2 3 Sequence 1 A C T Sequence 2 A C A Sequence 3 T G CSequence 4 A G C Sequence 5 A T G

In this case, the frequency matrix would be:

Position # 1 2 3 f(A): 0.8 0 0.2 f(T): 0.2 0.2 0.2 f(C): 0 0.4 0.4 f(G):0 0.4 0.2where f(A) is the frequency of Adenine (A) at that alignment position,f(T) the frequency of Thymine (T), etc.

Part 3c: Create Information Matrix.

Using the frequency matrix, an information matrix may be createdcontaining the amount of “information” provided by a given base i ateach alignment position j. In this case, information may be defined asthe decrease in Shannon uncertainty, calculated as:

I _(i,j)=2+log₂(p _(i,j))

where p_(i,j) is the frequency of base i at alignment position j.

In the instant example, the following information matrix may beobtained:

Position # 1 2 3 f(A): 0.8 0 0.2 f(T): 0.2 0.2 0.2 f(C): 0 0.4 0.4 f(G):0 0.4 0.2 I(A): 1.68 ∞ −0.32 I(T): −0.32 −0.32 −0.32 I(C): ∞ 0.68 0.68I(G): ∞ 0.68 −0.32

Part 3d: Determine an Information Score for Each Reference Sequence.

The information matrix determined in Part 3c describes the amount ofinformation provided by a given nucleotide at each alignment position.By summing these information contributions along the entire length of agiven sequence, an information score may be determined. This score mayrepresent the total amount of information encoded into the sequence.Sequences with high scores may contain many bases that are shared withother aligned sequences at that position. Sequences with low scores maybe regarded as “unusual”, containing low-frequency bases at manyalignment positions.

In the instant example, Sequence 1 may be scored in the followingfashion:

Sequence 1 A C T I_(A): 1.68 ∞ −0.32 I_(T): −0.32   0.32 −0.32 I_(C): ∞0.68   0.68 I_(G): ∞ 0.68 −0.32Thus, I_(sequence 1)=1.68+0.68+(−0.32)=2.04 bits of information

Since the aligned sequences may be of different lengths, the informationcontent of each sequence may be normalized by the number of by itcontains:

Ī _(sequence 1)=2.04/3=0.68 bits/by

By applying this logic to the entire multiple sequence alignment, it maybe possible to determine which of the reference sequences are mostrelevant to the consensus sequence.

The concept of sequence information has been used by others to identifysequence motifs, specific sequences conserved across many genomes thatmay indicate undiscovered genes, protein-binding sites, or otherbiochemical or structural information.

Part 3e: Determine Consensus Sequence.

Once information scores have been generated for each reference sequence,the multiple sequence alignment may be reduced to a single consensussequence. There are many ways to determine a consensus sequence, asknown by those having ordinary skill in the relevant art. For instance,a popular way is to select the most frequent base at each position in analignment. The literature describes a method for determining a consensussequence where the most frequent base is selected for each position ifits frequency is ≧0.875. However, if the frequency is less than that,the consensus base is left undefined for that position. Undefined bases,however, are not suitable for synthesis and incorporation into amolecular standard because there must be a base at every position in thesequence. Thus, the present disclosure uses a novel process.

To determine a consensus sequence, analysis may begin at the firstalignment position and move toward the last alignment position. At eachalignment position, the following decisions may be made, including:

-   -   i. Is the highest frequency of any base greater than 0.7? If        yes, assign that base to the consensus sequence and move to the        next position. If no, go to step ii.    -   ii. Look at the alignment of bases at that position. Which one        comes from the reference sequence with the lowest T score?        Remove it from consideration and go to step iii.    -   iii. After one instance of a base has been removed, recalculate        the base frequencies. Go back to step i.        A method for determining a consensus sequence may use the most        common base at a given position, even if that base is only        slightly more common than the others, e.g. a frequency of 0.26.        The cutoff frequency of 0.7 may be selected to balance the        consideration given to less common bases. A lower cutoff may        give such outliers too much weight while a higher cutoff may        give them too little weight.

Part 4: Computationally Verify the Consensus Sequence.

Prior to synthesis, computational tools may be used to predict whetherthe consensus sequence will behave similarly to one or more of thereference sequences.

Part 4a: In Silico RFLP Digest.

Restriction Fragment Length Polymorphism (RFLP) analysis is a commonmolecular biology technique for identifying differences between multipleDNA samples. During RFLP analysis, one or more restriction endonucleasesmay be used to digest the DNA samples of interest. A restrictionendonuclease may include an enzyme that cuts DNA at specific recognitionsites. For example, the EcoRI enzyme may cut a double-stranded DNArecognition sequence in the following fashion (‘|’ indicates a cutpoint):

5′ G|AATT C 3′ 3′ C TTAA|G 5′After digestion, the fragments of each DNA sample may be separated bysize using, for example, gel electrophoresis, producing a “fingerprint”that can be used to identify small sequence differences between the DNAsamples. During in silico RFLP digest, a sequence may be computationallysearched for a set of known restriction endonuclease recognition sites,and the number of bases in between the sites may be counted. With thisinformation, a model RFLP fingerprint may be created for that sequence.Ideally, the consensus sequence and reference sequences should have thesame number of RFLP fragments. Also, the corresponding fragments foreach sequence should be approximately the same length. If this is notthe case, the consensus sequence may need to be redesigned.

Part 4b: Primer/Probe Binding Simulation.

To determine if the consensus sequence is a good surrogate for thereference sequences, the behavior of these sequences inpreviously-published molecular assays may be simulated. Most molecularassays rely upon the use of short pieces of DNA referred to asoligonucleotides. Oligonucleotides may function as PCR primers, guidingthe amplification of certain genetic regions. Oligonucleotides may alsofunction as probes, binding to specific regions of the target DNA andemitting a signal of some sort. In both cases, assay performance may bedependent upon the oligonucleotides binding tightly to the target DNA inthe correct location.

During computational verification, the way that the oligonucleotidesdescribed in previously-published molecular assays bind to the consensusand reference sequences may be simulated. A software tool, such as, forexample, VisualOMP, may be used to perform these simulations, althoughother such tools exist. Ideally, each oligonucleotide should bind to theconsensus and reference sequences with the same strength (indicated bythe Gibb's free energy of hybridization and melting temperature of theheterodimer) and binding position. If this is not the case, theconsensus sequence may need to be redesigned.

Part 5. Synthesize the Molecular Standard.

Once a consensus sequence has been identified and computationallyverified, it may be synthesized and incorporated into a molecularstandard. The resulting molecular standard may include a circular DNAplasmid containing the consensus sequence as an insert. Other types ofconstructs are contemplated and within the scope of the presentdisclosure. The use of a plasmid carrier molecular provides stability tothe molecular standard, extending shelf life and allowing use in avariety of molecular assays. Creation of the molecular standard may beaccomplished by any means known to those skilled in the art, includingordering the standard from a third party manufacturer. One such thirdparty manufacturer is Blue Heron Biotechnology, although there areothers.

Part 6. Optimize the PCR Performance of the Molecular Standard.

Even though a molecular standard may share an identical insert sequenceas a sample of native genomic DNA, the two templates may not behaveidentically during molecular analysis. Due to the small size of eachplasmid, a vial of molecular standard may contain many more copies ofthe insert per mass of DNA than the genomic material. Genomic materialmay also contain many orders more non-target DNA than the molecularstandards. For these reasons, molecular standards may produce muchclearer results than native genomic DNA. In the case of PCR-basedassays, this means that the genomic DNA may amplify with lowerefficiency. This performance bias may be unacceptable if the molecularstandards are to serve as surrogates for native genomic material.

To ensure that the molecular standards demonstrate PCR efficiencysimilar to native genomic material, it may be necessary to introduce PCRinhibitors to the standard solutions. Many such inhibitors exist, suchas, e.g., non-target DNA, humic acids, polysaccharides, bile salts,immunoglobin G, heme, CTAB, SDS, alcohol, sodium acetate, sodiumchloride, EDTA, collagen, melanin, eumelanin, myoglobin, proteinases,calcium ions, urea, lactoferrin, indigo dye, and the like. Byintroducing one or more of these substances into solution with themolecular standards, the PCR efficiency of the standards may beoptimized to match that of native genomic DNA. This process of using PCRinhibitors to adjust the amplification efficiency of the molecularstandards may be a key component of guaranteeing their utility.

The process described above results in standardized positive controlsfor microbial pathogens that may be used to develop and validatemolecular detection and genotyping methods. Historically, most effortsin this area have focused on developing PCR inhibition positivecontrols, also known as internal controls. These typically involvegenerating either a linear or plasmid DNA template containing a short(<1,000 bp) insert, usually a PCR amplicon, flanked by specific primerrecognition sites. These molecules are then spiked into analyticalreactions to distinguish “no target” samples (internal control willamplify) from false negatives (internal control will not amplify,typically due to matrix inhibition). This approach has been used inassays for various organisms, including HSV 1 and 2, avian influenza,pestiviruses, Agrobacterium tumefaciens, and others.

Due to small insert size and targeted design, traditional internalcontrols, such as those listed above, share the same limitation: theyare necessarily assay-specific and cannot be used with alternativeprimer or probe sets. Therefore, research groups developing newdetection or genotyping assays must also generate their own internalcontrols from scratch. In contrast, the molecular standards disclosedherein involve significant portions of target genes, if not the entiresequence. In this way, any test developer targeting a selected gene canuse the molecular standards, regardless of their assay's specific primeror probe sequences.

Molecular standards developed according to the disclosure may be used assurrogates for native genomic DNA in a number of applications forvarious users. Molecular assay developers may use the molecularstandards to determine the suitability of PCR primers and probes duringdevelopment, to determine limit-of-detection specifications duringdevelopment, or to verify the species-specificity of PCR primers andprobes during development. End-users of a molecular assay may use themolecular standards to generate standard curves when quantifying theamount of target in a sample during qPCR, as a positive control todetermine if a particular PCR reaction mixture was created and pipettedcorrectly, as a spike-in control to determine the amount of PCRinhibition present in a sample matrix, as a spike-in control todetermine the recovery of a particular DNA extraction technique, or as agenotype control to compare against during genotyping assays, such asRFLP.

EXPERIMENTS

The above process was used to generate consensus sequences for 11species of Cryptosporidium (C. felis, C. parvum, C. muris, C. hominis,C. meleagridis, C. andersoni, C. baileyi, C. bovis, C. canis, C.serpentis, and C. wrairi), 1 species of Giardia (G. intestinalis), and 2species of microsporidia (Encephalitozoon intestinalis andEnterocytozoon bieneusi). Additionally, the above process was used tosynthesize standards for the 18s rRNA genes of 5 species ofCryptosporidium (C. hominis, C. parvum, C. muris, C. meleagridis, and C.felis) and 1 species of Giardia (G. intestinalis). The PCR performanceof these standards was then compared against that of genomic DNA fromthe target organisms using real-time PCR assays.

Prior to PCR amplification, synthetic standards were manufactured by athird-party. Native C. parvum, C. hominis, C. meleagridis, C. muris andG. intestinalis DNA was purchased from either ATCC or BEI Resources.Sample DNA concentrations (ng/μl) were converted to CN concentrations(CN/μl) based upon the CN densities calculated previously. Synthetic andnative DNA samples were serially diluted in TE (10 mM Tris-HCl, 1 mMEDTA, pH 8.0) to obtain concentrations between 107 and 100 CN/μl.

PCR amplification was performed in triplicate using a real-time PCRassay optimized for the LightCycler 2.0. PCR amplification was confirmedby gel electrophoresis using the Invitrogen E-Gel® Ex (2% agarose)system.

FIGS. 1-6 show standard curves for each species. The standard curveswere calculated by plotting PCR threshold cycle values (CO versus log₁₀(CN/r×n) for each DNA type. Linearity (R²) and efficiency werecalculated over at least 4 orders of magnitude using linear regression.

Standard curves generated using synthetic or native DNA demonstratednearly identical R² values of approximately 0.99. For each species, thesynthetic and native DNA demonstrated substantially equivalent PCRperformance (Table 1), demonstrating that either could be used toconstruct an accurate standard curve for target quantification.

While the present disclosure has been described in terms of exemplaryembodiments, those skilled in the art will recognize that the presentdisclosure can be practiced with (or without) modifications in thespirit and scope of the appended claims. The examples disclosed hereinare merely illustrative and are not meant to be an exhaustive list ofall possible designs, embodiments, applications or modifications of thepresent disclosure.

TABLE 1 Limit of Detection, Efficiency, and Linearity LOD (CN/Efficiency Δ Δ Organisms Type rxn) (%) R² Eff Tm C. hominis Genomic 10101.82% 0.99229 2.07% 0.09 Synthetic 10 103.89% 0.99466 C. parvumGenomic 10 95.34% 0.99397 6.57% 0.31 Synthetic 10 101.91% 0.99355 C.meleagridis Genomic 10 97.31% 0.99513 1.83% −0.15 Synthetic 10 95.49%0.99677 C. muris Genomic N/A 115.78% 0.99239 7.04% N/A Synthetic 10108.73% 0.99074 C. felis Synthetic 10 107.90% 0.99691 N/A N/A G. lambliaGenomic 5500 135.89% 0.98598 8.11% Synthetic 5500 127.79% 0.98832

What is claimed is:
 1. A synthetic nucleic acid molecule having aconsensus sequence constructed from an alignment of a plurality ofnucleic acid sequences from microbial pathogens, the consensus sequenceconstructed by a method comprising: generating a frequency matrixcomprising the frequency of each base at each position in the alignment;creating an information matrix comprising the amount of informationprovided by each base at each position in the alignment; and calculatingan information score for each of the plurality of nucleic acidsequences; iterating over each position in the alignment and for eachposition: calculating base frequencies and determining a highestfrequency base; if the highest frequency base's frequency is higher thana frequency threshold, assigning the highest frequency base to theconsensus sequence; and if the highest frequency base's frequency islower than the frequency threshold, removing a base corresponding to anucleic acid sequence with a lowest information score and returning to(a); and performing a restriction fragment length polymorphism (RFLP)fingerprint of the constructed consensus sequence in silico.
 2. Thesynthetic nucleic acid molecule according to claim 1, comprising the 18SrRNA consensus sequence for Cryptosporidium andersoni shown in SEQ IDNO:1.
 3. The synthetic nucleic acid molecule according to claim 1,comprising the 18S rRNA consensus sequence for Cryptosporidium baileyishown in SEQ ID NO:2.
 4. The synthetic nucleic acid molecule accordingto claim 1, comprising the 18S rRNA consensus sequence forCryptosporidium bovis shown in SEQ ID NO:3.
 5. The synthetic nucleicacid molecule according to claim 1, comprising the 18S rRNA consensussequence for Cryptosporidium canis shown in SEQ ID NO:4.
 6. Thesynthetic nucleic acid molecule according to claim 1, comprising the 18SrRNA consensus sequence for Cryptosporidium felis shown in SEQ ID NO:5.7. The synthetic nucleic acid molecule according to claim 1, comprisingthe 18S rRNA consensus sequence for Cryptosporidium hominis shown in SEQID NO:6.
 8. The synthetic nucleic acid molecule according to claim 1,comprising the 18S rRNA consensus sequence for Cryptosporidiummeleagridis shown in SEQ ID NO:7.
 9. The synthetic nucleic acid moleculeaccording to claim 1, comprising the 18S rRNA consensus sequence forCryptosporidium muris shown in SEQ ID NO:8.
 10. The synthetic nucleicacid molecule according to claim 1, comprising the 18S rRNA consensussequence for Cryptosporidium parvum shown in SEQ ID NO:9.
 11. Thesynthetic nucleic acid molecule according to claim 1, comprising the 18SrRNA consensus sequence for Cryptosporidium serpentis shown in SEQ IDNO:10.
 12. The synthetic nucleic acid molecule according to claim 1,comprising the 18S rRNA consensus sequence for Cryptosporidium wrairishown in SEQ ID NO:11.
 13. The synthetic nucleic acid molecule accordingto claim 1, comprising the 18S rRNA consensus sequence for Giardiaintestinalis shown in SEQ ID NO:12.
 14. The synthetic nucleic acidmolecule according to claim 1, comprising the 18S rRNA consensussequence for Encephalitozoon intestinalis shown in SEQ ID NO:13.
 15. Thesynthetic nucleic acid molecule according to claim 1, comprising the 18SrRNA consensus sequence for Enterocytozoon bieneusi shown in SEQ IDNO:14.
 16. A synthetic nucleic acid molecule having a consensus sequenceselected from the group consisting of: an 18S rRNA consensus sequencefor Cryptosporidium andersoni, as shown in SEQ ID NO:1; an 18S rRNAconsensus sequence for Cryptosporidium baileyi, as shown in SEQ ID NO:2;an 18S rRNA consensus sequence for Cryptosporidium bovis, as shown inSEQ ID NO:3; an 18S rRNA consensus sequence for Cryptosporidium canis,as shown in SEQ ID NO:4; an 18S rRNA consensus sequence forCryptosporidium felis, as shown in SEQ ID NO:5; an 18S rRNA consensussequence for Cryptosporidium hominis, as shown in SEQ ID NO:6; an 18SrRNA consensus sequence for Cryptosporidium meleagridis, as shown in SEQID NO:7; an 18S rRNA consensus sequence for Cryptosporidium muris, asshown in SEQ ID NO:8; an 18S rRNA consensus sequence for Cryptosporidiumparvum, as shown in SEQ ID NO:9; an 18S rRNA consensus sequence forCryptosporidium serpentis, as shown in SEQ ID NO:10; an 18S rRNAconsensus sequence for Cryptosporidium wrairi, as shown in SEQ ID NO:11;an 18S rRNA consensus sequence for Giardia intestinalis, as shown in SEQID NO:12; an 18S rRNA consensus sequence for Encephalitozoonintestinalis, as shown in SEQ ID NO:13; and an 18S rRNA consensussequence for Enterocytozoon bieneusi, as shown in SEQ ID NO:14.
 17. Asynthetic nucleic acid construct containing the synthetic nucleic acidhaving a consensus sequence as in claim
 16. 18. The synthetic nucleicacid construct of claim 17, wherein the construct is linear.
 19. Thesynthetic nucleic acid construct of claim 17, wherein the construct is acircular plasmid.
 20. A microbial pathogen test kit comprising: asynthetic nucleic acid molecule having a consensus sequence constructedfrom an alignment of a plurality of nucleic acid sequences frommicrobial pathogens, the consensus sequence constructed by a methodcomprising: generating a frequency matrix comprising the frequency ofeach base at each position in the alignment; creating an informationmatrix comprising the amount of information provided by each base ateach position in the alignment; and calculating an information score foreach of the plurality of nucleic acid sequences; iterating over eachposition in the alignment and for each position: calculating basefrequencies and determining a highest frequency base; if the highestfrequency base's frequency is higher than a frequency threshold,assigning the highest frequency base to the consensus sequence; and ifthe highest frequency base's frequency is lower than the frequencythreshold, removing a base corresponding to a nucleic acid sequence witha lowest information score and returning to (a); and performing arestriction fragment length polymorphism (RFLP) fingerprint of theconstructed consensus sequence in silico.